Case Study 1: Audience Size

Executive Summary

The intent of this executive summary is to answer two questions.

  1. Who listens to Sirius Radio?
  2. Who listens to Sirius Business Radio by Wharton?
  3. Does this sample appear to be a random sample from the general population?
  4. Does this sample appear to be a random sample from the MTURK population?

Based on the responses of these answers we move forward with the following implications.

  1. What method should be utilized to estimate the audience size?
  2. what data should be collected and where should it come from?

The variables that were selected for analysis were, age, education, gender, income, sirius, wharton, and worktime. Survey questions that did not have a response were recoded to NA. Survey questions that were input in the incorrect format, were changed to be in the appropriate format. For example, age should have been a numeric input; however, some respondents wrote out their age that being ‘eighteen’ rather than 18. Other examples of incorrect information includes typing 223. The age 223, is far beyond the life span of a normal human. It cannot be assumed that the individual intended to type 22, 23, or 32. As such, this response was changed to NA.

Summary Tables

  1. Who listens to Sirius Radio?
  2. Who listens to Sirius Business Radio by Wharton?
  3. Does this samples appear to be a random sample from the general population?
  4. Does this sample appear to be a random sample from the MTURK population?

The first four questions can be responded to by analyzing the following figures.

Educational attainment

The summary table is split by educational attainment and shows that most listeners has some college or no diploma or has a bachelor’s degree. For the individuals who own a bachelor’s degree, the average age of the listener is 28.5 while the median is 26. For individuals who hold a bachelors degree, the mean age is 31.14 while the median is 28.

Education Frequency Mean_Age Median_Age Mean_Worktime Median_Worktime
Less than 12 years; no high school diploma 10 30.00000 27.0 23.90000 22.0
High school graduate (or equivalent) 191 30.21466 27.0 23.32984 21.0
Some college, no diploma; or Associates degree 737 28.51967 26.0 22.49389 21.0
Bachelors degree or other 4-year degree 612 31.14379 28.0 21.96078 20.0
Graduate or professional degree 177 34.57627 31.0 23.22034 21.0
Other 2 50.50000 50.5 46.50000 46.5

Age

For a closer look at the age of listeners a frequency table has been included to show the gender breakdown with the mean and median age of male and female listeners.

Gender Frequency Mean_Age Median_Age Mean_Worktime Median_Worktime
Female 729 31.9561 29 23.52812 21
Male 1000 29.0750 27 21.76400 20

Histogram of age which is split by gender and income also show the breakdown of age.

Income

A table has also been included below which shows the income level of participants of the survey. From the survey most participants had an income between $30,000 and $50,000. The mean age was 30 while the median was 28.

Income Frequency Mean_Age Median_Age Mean_Worktime Median_Worktime
Less than $15,000 206 26.57767 24 21.19903 20.0
$15,000 - $30,000 360 29.45833 27 23.09167 21.0
$30,000 - $50,000 421 30.73634 28 22.85036 20.0
$50,000 - $75,000 372 31.46774 29 22.15860 20.0
$75,000 - $150,000 326 31.59202 30 22.80368 21.0
Above $150,000 44 30.59091 24 21.34091 20.5

Wharton and Sirius?

Box Plots

The box plot, below faceted by educational attainment and income specifically describes the users who listen to both Wharton radio and Sirius radio. Each of the boxplots are split by gender and describe the age of each listener.

Conclusion: Who listens to Wharton Radio?

Based on the charts above, the age of the average listeners tends to be mid 20’s to early 30’s. Usually they will have some educational background of some sort and more often tend to be male than female. Generally these individuals will be making $30,000 - $50,000 a year. Knowing the general audience base, the next two questions can be responded to.

  1. What method should be utilized to estimate the audience size?
  2. what data should be collected and where should it come from?

Method for estimating Audience Size?

This dataset generally represents a broad population. Based on this data we can infer the estimated audience size by filtering people who listen to sirius XM and people who listen to wharton’s podcast. This however, gives that aggregate audience size rather than individual downloads per podcast.

What data should be collected and where should it come from?

Rather than administering data from MTURK, a more direct approach to listener base can be found by pulling data from stitcher which has been aquired by Sirius XM. This data may provide more specific and provide helpful information such as age to rating. Further the data should describe gender identity and sex as a spectrum to be inclusive of individuals who do not identify as male or female as this can provide future topics that wharton professors can address.

Case Study 2: Women in Science

Questions

  1. How many fields?
  2. type of degrees?
  3. years of statistics being reported

In this data set there are 10 different fields of study, 3 different degree levels and 11 years worth of data from 2006 - 2016. This can be seen in the summary tables below.

By Field and Sex

This table shows that 10 different fields of study and the total amount of Male and Female people who studied the field. Overall looking at the 10 fields, there is an even breakdown of Female dominated fields to Male dominated fields.

Field Male Female More
Agricultural sciences 152956 172852 Female
Biological sciences 516556 745384 Female
Computer sciences 626248 171808 Male
Earth, atmospheric, and ocean sciences 53118 36076 Male
Engineering 1147112 301426 Male
Mathematics and statistics 173641 125083 Male
Non-S&E 7356720 11916324 Female
Physical sciences 194365 120797 Male
Psychology 326865 1109902 Female
Social sciences 1041377 1244592 Female

By Degree and Sex

This table shows the 3 different types of degrees over the course of 11 years with sex being broken down. The data shows that although there are more female people collectively receiving Bachelors degrees and Masters degrees, Male people attain more Ph.Ds.

Degree Male Female More
BS 8137972 10930199 Female
MS 3104414 4669113 Female
PhD 346572 344932 Male

By Year and Sex

Finally the last table shows the amount of degrees attained each year by male and females. Showing that overall, female people overall earn more degrees and that there has been a stead increase of degrees over time.

Year Male Female More
2006 904679 1253917 Female
2007 925621 1287439 Female
2008 953360 1320480 Female
2009 985411 1360820 Female
2010 1019514 1404646 Female
2011 1063992 1466539 Female
2012 1107721 1525402 Female
2013 1130821 1552075 Female
2014 1147769 1570559 Female
2015 1163164 1586060 Female
2016 1186906 1616307 Female

BS degrees in 2015

The summary statistics tables above are high level summaries. To understand the breakdown of science related fields vs non-science fields in 2015, separate bar plots have been made.

2015 Overall Science and Engineering This bar plot shows the breakdown of science and engineering fields and compares them to the broader category of non-science and engineering fields.

To show the difference in sex between Non-science and engineering fields and science and engineering fields in 2015 the following bar plot has been made. As the tables above show generally there are more woman obtaining degrees, which is highlighted by the larger Non-S&E bar. However; despite when comparing S&E, the bar for Male and Female, are almost similar in frequency.

Questions

In general, the summary tables show that there female people are overall earning more degrees; however, when broken down by science-related fields Female people generally study, social sciences, psychology, biology, and agricultural sciences. Male people study physical sciences, mathematics and statistics, engineering, earth, atmospheric, and ocean sciences, and computer Science. This can be found in the box plot below. This is consistent with literature that shows that these fields are often not inclusive of woman, and in particular women of color.

To see if woman are in the field of data science, the plot has been filtered to only include, computer science, math and statistics. Despite the foundations that woman have established in the fields of mathematics, statistics and computer science, woman are underrepresented in both fields.

Conclusion

Based on this dataset, there is a consistently lower proportion of women in science-related fields across years. However, it remains unclear which fields consisted of the non-SE related fields, or why the above nine fields were chosen to represent “Science-related” fields. Other fields, such as chemical sciences and medical and health sciences, among others, belong to science as well. Future studies could improve by asking participants to clarify their degree fields in the data collection process.

Case Study 3: Major League Baseball

For this study we conducted exploratory data analysis to describe the increment of payroll in each year. Rather than taking the difference in payroll, we used the log difference in payroll. For example, rather than taking the difference of 2013 payroll - 2012 payroll, we took the log(2013 payroll) - log(2012 payroll). We used the log difference to prove the payroll gap percentage.

The table below shows total change in payroll from 2010 - 2014 inclusive. The column change in Overall_Change_Payroll, takes the incremental sum of change in payroll from 2010-2014. Based on this, the Los Angeles Dodgers, had the greatest change in payroll. Please refer to the table below for the other following teams.

Team Sum of log change in payroll Average_Payroll Aggregate_Wins Aggregate_Wins_Pct
Los Angeles Dodgers 0.8511002 149.15418 434 2.682156
Washington Nationals 0.8200017 91.04086 429 2.651215
San Diego Padres 0.7443947 59.23019 390 2.407407
Texas Rangers 0.6839576 103.63741 437 2.694085
San Francisco Giants 0.6294727 125.62321 436 2.691358

The graph below shows the change in log payroll difference from year to year.

The table below shows the difference in win percentage from 2010 to 2014 from greatest to least. The table shows that the Arizona diamondbacks had the greatest difference in log win percentage from 2014 to 2010.

Team Difference of log win Percent Aggregate_Wins
Arizona Diamondbacks 0.1790123 385
Boston Red Sox 0.1728395 416
Cleveland Indians 0.1481481 394
Baltimore Orioles 0.1481481 409
Los Angeles Angels 0.1234568 431

The teams with the largest total change in payroll have been highlighted in the chart below.

Yearly payroll or yearly increase in payroll

We would argue that increase in payroll on the log scale does not lead to better performance. Referring to the table which includes overall change in payroll, there are very little differences in aggregate win percentage despite having a wide range of overall payroll between the top 5 teams.

To compared if payroll or log increase in payroll was more effective, we regressed win percentage on payroll and log_Payroll_Diff and compared the r-squared values. Payroll gave an r-squared value of 0.1173386 and log payroll in crease gave an r-squared value of 0.0089299. Based on these two r-sqaured values we determined that payroll is more effective in explaining performance.